Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

traefik_open_connections metric drifts down until negative #10733

Open
2 tasks done
navaati opened this issue May 17, 2024 · 8 comments
Open
2 tasks done

traefik_open_connections metric drifts down until negative #10733

navaati opened this issue May 17, 2024 · 8 comments
Assignees
Labels
area/middleware/metrics kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed.

Comments

@navaati
Copy link

navaati commented May 17, 2024

Welcome!

  • Yes, I've searched similar issues on GitHub and didn't find any.
  • Yes, I've searched similar issues on the Traefik community forum and didn't find any.

What did you do?

I configured Prometheus monitoring on a traefik instance with --metrics.prometheus=true on a separate entrypoint (for a separate port) with no other Prometheus specific configuration, so I got the default set of metrics. I can query the /metrics allright and I would expect traefik_open_connections to stay at 5 (as that’s my load, 5 clients from a load testing tool).

What did you see instead?

The value of the traefik_open_connections metric goes down over time, until it even gets negative, which doesn’t make sense for a connection count (how can you have -1 open connections…).

See this excerpt from the /metrics:

# HELP traefik_open_connections How many open connections exist, by entryPoint and protocol
# TYPE traefik_open_connections gauge
traefik_open_connections{entrypoint="metrics",protocol="TCP"} -1
traefik_open_connections{entrypoint="traefik",protocol="TCP"} 0
traefik_open_connections{entrypoint="web",protocol="TCP"} -1

and this screenshot from Prometheus:
image

In the screenshot, the metric is supposed to stay at a constant 5 connections.

As a proof that there is actually still 5 connections, here is the output of ss (where we also see the connections to the backend service):

State           Recv-Q           Send-Q                           Local Address:Port                                 Peer Address:Port           Process
ESTAB           0                0                                   172.18.0.3:42488                                  172.18.0.2:8000            users:(("traefik",pid=2506,fd=19))
ESTAB           0                0                                   172.18.0.3:32842                                  172.18.0.2:8000            users:(("traefik",pid=2506,fd=18))
ESTAB           0                0                                   172.18.0.3:38010                                  172.18.0.2:8000            users:(("traefik",pid=2506,fd=17))
ESTAB           0                0                                   172.18.0.3:36600                                  172.18.0.2:8000            users:(("traefik",pid=2506,fd=16))
ESTAB           0                0                                   172.18.0.3:35230                                  172.18.0.2:8000            users:(("traefik",pid=2506,fd=23))
ESTAB           0                0                          [::ffff:172.18.0.3]:80                        [::ffff:217.182.237.24]:57220           users:(("traefik",pid=2506,fd=12))
ESTAB           0                0                          [::ffff:172.18.0.3]:80                        [::ffff:217.182.237.24]:57206           users:(("traefik",pid=2506,fd=10))
ESTAB           0                0                          [::ffff:172.18.0.3]:80                        [::ffff:217.182.237.24]:57250           users:(("traefik",pid=2506,fd=14))
ESTAB           0                0                          [::ffff:172.18.0.3]:80                        [::ffff:217.182.237.24]:57234           users:(("traefik",pid=2506,fd=13))
ESTAB           0                0                          [::ffff:172.18.0.3]:80                        [::ffff:217.182.237.24]:57222           users:(("traefik",pid=2506,fd=11))
ESTAB           0                0                          [::ffff:172.18.0.3]:8089                      [::ffff:217.182.237.24]:34364           users:(("traefik",pid=2506,fd=20))

What version of Traefik are you using?

Version:      3.0.0
Codename:     beaufort
Go version:   go1.22.2
Built:        2024-04-29T14:25:59Z
OS/Arch:      linux/amd64

What is your environment & configuration?

      --entrypoints.web.address=:80
      --entryPoints.web.asDefault=true
      --providers.docker
      --providers.docker.exposedbydefault=false
      --api.dashboard=true
      --api.insecure=true
      --entryPoints.metrics.address=:8089
      --metrics.prometheus=true
      --metrics.prometheus.entryPoint=metrics

The labels on the backend container discovered by the docker provider:

     - "traefik.enable=true"
     - "traefik.http.routers.app.rule=PathPrefix(`/`)"
     - "traefik.http.services.app.loadbalancer.server.port=8000"

The load on the service is a constant 5 clients repeatedly requesting the service, generated by locust (a python load testing tool). The backend is a python app that can handle one client at once and takes around 110ms for one request, so with a load of 5 clients it handles around 9 req/s with a latency of 550ms.

If applicable, please paste the log output in DEBUG level

No response

@navaati
Copy link
Author

navaati commented May 18, 2024

Digging a bit more… I don’t really speak Go but I’ll try… Clearly this function RemoveConnection is being called too much, or AddConnection not enough.

@navaati
Copy link
Author

navaati commented May 18, 2024

Errr, so something looks off to me, but probably is not. Reproducing the code of RemoveConnection here:

func (c *connectionTracker) RemoveConnection(conn net.Conn) {
	c.connsMu.Lock()
	delete(c.conns, conn)
	c.connsMu.Unlock()

	if c.openConnectionsGauge != nil {
		c.openConnectionsGauge.Add(-1)
	}
}

Is it normal that the Gauge decrementation is outside of the mutex section ? I guess the instrumentation library is threadsafe already, is it ?

@navaati
Copy link
Author

navaati commented May 18, 2024

@rtribotte I see that you authored this highly relevant commit 7c2af10 from PR #9656, might you be able to chip in ?

@thanhphuchuynh
Copy link

thanhphuchuynh commented May 20, 2024

I have an issue too

my metrics:
HELP traefik_open_connections How many open connections exist, by entryPoint and protocol
TYPE traefik_open_connections gauge
traefik_open_connections{entrypoint="metrics",protocol="TCP"} 2
traefik_open_connections{entrypoint="traefik",protocol="TCP"} -26
traefik_open_connections{entrypoint="web",protocol="TCP"} -26
traefik_open_connections{entrypoint="websecure",protocol="TCP"} 8

@rtribotte rtribotte added kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. area/middleware/metrics and removed status/0-needs-triage labels May 23, 2024
@rtribotte rtribotte self-assigned this May 23, 2024
@rtribotte
Copy link
Member

Hello @navaati, thanks for opening this!

Is it normal that the Gauge decrementation is outside of the mutex section ? I guess the instrumentation library is threadsafe already, is it ?

It should be threadsafe indeed (https://github.com/prometheus/client_golang/blob/release-1.17/prometheus/gauge.go#L122C1-L130C2), Have you had a chance to try the alternative (protect with the mutex)? Did it fix the bug?

@navaati
Copy link
Author

navaati commented May 23, 2024

Hi.

No, I haven’t tested anything yet: I first need to figure out how to build the binary then the container, for which I haven’t found instructions yet (at least nothing in the contributing documentation). EDIT: found the doc https://doc.traefik.io/traefik/contributing/building-testing/ !

I’ll find and try to test, although with that link you dug I don’t have much hope. Is it actually the official prom client which is being used here though ? I see the type of the gauge is gokitmetrics.Gauge.

@navaati
Copy link
Author

navaati commented May 24, 2024

Yeah no, it’s just an abstraction layer over the official prom client: https://github.com/go-kit/kit/blob/dfe43fa6a8d72c23e2205d0b80e762346e203f78/metrics/prometheus/prometheus.go#L84.

@navaati
Copy link
Author

navaati commented May 28, 2024

I just realized there was a major Træfik version release during my development cycle ! That’ll teach me not pinning containers…

Anyway, just for the record I tested with container tag v2.11.3 and the result is much more stable (although I understand there were other problems with the former traefik_entrypoint_open_connections metric versus the new traefik_open_connections one):
image

Now I’ll test a modified v3.0 with the +1/-1 in the mutex critical section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/middleware/metrics kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed.
Projects
None yet
Development

No branches or pull requests

4 participants